Cell2location maps cell types by integrating single cell/nucleus and spatial transcriptomics data. This is achieved by estimating which combination of cell types in which cell abundance could have given the mRNA counts in the spatial data, taking technical effects into account (platform/technology effect, contaminating RNA, unexplained variance).
Given cell type annotation for each cell, the corresponding reference cell type signatures $g_{f,g}$, which represent the average mRNA count of each gene $g$ in each cell type $f={1, .., F}$, can be estimated from sc/snRNA-seq data using 2 provided methods (see below). Cell2location needs untransformed unnormalised spatial mRNA counts as input. You also need to provide cell2location with the expected average cell abundance per location which is used as a prior to guide estimation of absolute cell abundance. This value depends on the tissue and can be estimated by counting nuclei for a few locations in the paired histology image but can be approximate (see paper methods for more guidance).
We provide 2 methods for estimating reference expression signatures of cell types from scRNA-seq data:
1) a statistical method based on Negative Binomial regression. We generally recommend using NB regression, which allows to robustly combine data across technologies and batches, which results in improved spatial mapping accuracy. This notebook shows use a dataset composed on multiple batches and technologies to estimate that.
2) hard-coded computation of per-cluster average mRNA counts for individual genes (scvi.external.cell2location.compute_cluster_averages). When the batch effects are small, this faster hard-coded method of computing per cluster averages provides similarly high accuracy. We also recommend the hard-coded method for non-UMI technologies such as Smart-Seq 2.
import sys
#if branch is stable, will install via pypi, else will install from source
branch = "pyro-cell2location"
user = "vitkl"
IN_COLAB = "google.colab" in sys.modules
if IN_COLAB and branch == "stable":
!pip install --quiet scvi-tools[tutorials]
elif IN_COLAB and branch != "stable":
!pip install --quiet --upgrade jsonschema
!pip install --quiet git+https://github.com/$user/scvi-tools@$branch#egg=scvi-tools[tutorials]
import sys
if not IN_COLAB:
sys.path.insert(1, '/nfs/team205/vk7/sanger_projects/BayraktarLab/scvi-tools/')
import scanpy as sc
import anndata
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
#import cell2location
import scvi
from matplotlib import rcParams
rcParams['pdf.fonttype'] = 42 # enables correct plotting of text
import seaborn as sns
First let's read spatial Visium data from 10X Space Ranger output.
results_folder = './results/lymph_nodes_analysis/'
scvi_run_name = f'{results_folder}/non_amortised/signatures_lr0002_Adam'
sc_results_folder = '/nfs/team205/vk7/sanger_projects/collaborations/fetal_gut_mapping/results/'
regression_model_output = 'v1_ye_signatures_lr0002_Adam'
reg_path = f'{sc_results_folder}regression_model/{regression_model_output}/'
adata_vis = sc.datasets.visium_sge(sample_id="V1_Human_Lymph_Node")
adata_vis.obs['sample'] = list(adata_vis.uns['spatial'].keys())[0]
Variable names are not unique. To make them unique, call `.var_names_make_unique`. Variable names are not unique. To make them unique, call `.var_names_make_unique`.
The signatures are estimated from scRNA-seq data, accounting for batch effect, using a Negative binomial regression model.
## snRNAseq reference (raw counts)
adata_snrna_raw = sc.read(f'{reg_path}sc.h5ad')
# export estimated expression in each cluster
if 'means_per_cluster_mu_fg' in adata_snrna_raw.varm.keys():
inf_aver = adata_snrna_raw.varm['means_per_cluster_mu_fg'][[f'means_per_cluster_mu_fg_{i}'
for i in adata_snrna_raw.uns['mod']['factor_names']]].copy()
else:
inf_aver = adata_snrna_raw.var[[f'means_per_cluster_mu_fg_{i}'
for i in adata_snrna_raw.uns['mod']['factor_names']]].copy()
inf_aver.columns = adata_snrna_raw.uns['mod']['factor_names']
inf_aver.iloc[0:5, 0:5]
| Activated CD4 T | Activated CD8 T | Adult Glia | BEST2+ Goblet cell | BEST4+ epithelial | |
|---|---|---|---|---|---|
| ENSG00000187634 | 0.000475 | 0.000341 | 0.008211 | 0.001663 | 0.001148 |
| ENSG00000188976 | 0.056281 | 0.082341 | 0.061379 | 0.256258 | 0.149516 |
| ENSG00000187583 | 0.002949 | 0.016270 | 0.003423 | 0.007914 | 0.001061 |
| ENSG00000188290 | 0.041080 | 0.058555 | 0.128032 | 0.130205 | 1.816961 |
| ENSG00000187608 | 0.123889 | 0.235275 | 0.197328 | 0.110768 | 0.701299 |
# rename genes to ENSEMBL
adata_vis.var['SYMBOL'] = adata_vis.var_names
adata_vis.var_names = adata_vis.var['gene_ids']
adata_vis.var_names.name = None
# find shared genes and subset both anndata and reference signatures
intersect = np.intersect1d(adata_vis.var_names, inf_aver.index)
adata_vis = adata_vis[:, intersect].copy()
inf_aver = inf_aver.loc[intersect, :].copy()
# prepare anndata for scVI model
scvi.data.setup_anndata(adata=adata_vis, batch_key="sample")
scvi.data.view_anndata_setup(adata_vis)
INFO Using batches from adata.obs["sample"] INFO No label_key inputted, assuming all cells have same label INFO Using data from adata.X INFO Computing library size prior per batch INFO Successfully registered anndata object containing 4035 cells, 14349 vars, 1 batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra continuous covariates. INFO Please do not further modify adata until model is trained.
Anndata setup with scvi-tools version 0.0.0.
Data Summary ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓ ┃ Data ┃ Count ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩ │ Cells │ 4035 │ │ Vars │ 14349 │ │ Labels │ 1 │ │ Batches │ 1 │ │ Proteins │ 0 │ │ Extra Categorical Covariates │ 0 │ │ Extra Continuous Covariates │ 0 │ └──────────────────────────────┴───────┘
SCVI Data Registry ┏━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Data ┃ scvi-tools Location ┃ ┡━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ X │ adata.X │ │ batch_indices │ adata.obs['_scvi_batch'] │ │ local_l_mean │ adata.obs['_scvi_local_l_mean'] │ │ local_l_var │ adata.obs['_scvi_local_l_var'] │ │ labels │ adata.obs['_scvi_labels'] │ └───────────────┴─────────────────────────────────┘
Label Categories ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ Source Location ┃ Categories ┃ scvi-tools Encoding ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ adata.obs['_scvi_labels'] │ 0 │ 0 │ └───────────────────────────┴────────────┴─────────────────────┘
Batch Categories ┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓ ┃ Source Location ┃ Categories ┃ scvi-tools Encoding ┃ ┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩ │ adata.obs['sample'] │ V1_Human_Lymph_Node │ 0 │ └─────────────────────┴─────────────────────┴─────────────────────┘
# create and train the model
mod = scvi.external.Cell2location(
adata_vis, cell_state_df=inf_aver,
# the expected average cell abundance - user-provided, tissue-dependent hyper-prior:
N_cells_per_location=30)
import pyro
mod.train(max_epochs=20000,
# train using full data (batch_size=None)
batch_size=None,
# use all data points in training because
# we need to estimate cell abundance at all locations
train_size=1,
#plan_kwargs={'optim': pyro.optim.ClippedAdam(optim_args={'lr': 0.002, 'clip_norm': 10})},
plan_kwargs={'optim': pyro.optim.Adam(optim_args={'lr': 0.002})},
use_gpu=True)
# plot ELBO loss history during training, removing first 100 epochs from the plot
mod.plot_history(5000)
plt.legend(labels=['full data training']);
GPU available: True, used: True TPU available: False, using: 0 TPU cores /nfs/team283/vk7/software/miniconda3farm5/envs/scvi-env2/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py:69: UserWarning: you passed in a val_dataloader but have no validation_step. Skipping val loop warnings.warn(*args, **kwargs) LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Epoch 20000/20000: 100%|██████████| 20000/20000 [52:13<00:00, 6.38it/s, v_num=1, elbo_train=5.13e+7]
# In this section, we export the estimated cell abundance (summary of the posterior distribution).
adata_vis = mod.export_posterior(
adata_vis, sample_kwargs={'num_samples': 1000, 'batch_size': 2020, 'use_gpu': True}
)
# Save model
mod.save(f"{scvi_run_name}_v0_gut_pub_ref_20kiter_c2l", overwrite=True)
# can be loaded later like this:
# mod = scvi.external.cell2location.Cell2location.load(f"{scvi_run_name}_c2l", adata_vis)
# Save anndata object with results
adata_file = f"{scvi_run_name}_v0_gut_pub_ref_20kiter_c2l/sp.h5ad"
adata_vis.write(adata_file)
adata_file
Sampling local variables, batch: 100%|██████████| 2/2 [00:48<00:00, 24.21s/it] Sampling global variables, sample: 100%|██████████| 999/999 [00:28<00:00, 34.92it/s]
... storing 'sample' as categorical ... storing 'feature_types' as categorical ... storing 'genome' as categorical ... storing 'SYMBOL' as categorical
'./results/lymph_nodes_analysis//non_amortised/signatures_lr0002_Adam_v0ref_20kiter_c2l/sp.h5ad'
# Save model
mod.save(f"{scvi_run_name}_v0_gut_pub_ref_20kiter_c2l", overwrite=True)
# can be loaded later like this:
# mod = scvi.external.cell2location.Cell2location.load(f"{scvi_run_name}_c2l", adata_vis)
# Save anndata object with results
adata_file = f"{scvi_run_name}_v0_gut_pub_ref_20kiter_c2l/sp.h5ad"
adata_vis.write(adata_file)
adata_file
'./results/lymph_nodes_analysis//non_amortised/signatures_lr0002_Adam_v0_gut_pub_ref_20kiter_c2l/sp.h5ad'
# Examine reconstruction accuracy to assess if there are any issues with mapping
# the plot should be roughly diagonal, strong deviations will signal problems
mod.plot_QC()
def select_slide(adata, s, s_col='sample'):
r""" This function selects the data for one slide from the spatial anndata object.
:param adata: Anndata object with multiple spatial experiments
:param s: name of selected experiment
:param s_col: column in adata.obs listing experiment name for each location
"""
slide = adata[adata.obs[s_col].isin([s]), :].copy()
s_keys = list(slide.uns['spatial'].keys())
s_spatial = np.array(s_keys)[[s in k for k in s_keys]][0]
slide.uns['spatial'] = {s_spatial: slide.uns['spatial'][s_spatial]}
return slide
# add 5% quantile, representing confident cell abundance, 'at least this amount is present',
# to adata.obs with nice names for plotting
adata_vis.obs[adata_vis.uns['mod']['factor_names']] = adata_vis.obsm['q05_cell_abundance_w_sf']
# plot in spatial coordinates
with mpl.rc_context({'axes.facecolor': 'black',
'figure.figsize': [4.5, 5]}):
# select one slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
sc.pl.spatial(slide, cmap='magma',
# show first 8 cell types
color=adata_vis.uns['mod']['factor_names'][0:8],
ncols=4, size=1.3,
img_key='hires',
# limit color scale at 99.2% quantile of cell abundance
vmin=0, vmax='p99.2'
)
/nfs/team283/vk7/software/miniconda3farm5/envs/scvi-env2/lib/python3.7/site-packages/pandas/core/arrays/categorical.py:2487: FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version. res = method(*args, **kwargs)
# plot in spatial coordinates for all cell types
with mpl.rc_context({'axes.facecolor': 'black',
'figure.figsize': [4.5, 5]}):
# select one slide
slide = select_slide(adata_vis, 'V1_Human_Lymph_Node')
sc.pl.spatial(slide, cmap='magma',
color=adata_vis.uns['mod']['factor_names'],
ncols=4, size=1.3,
img_key='hires',
# limit color scale at 99.2% quantile of cell abundance
vmin=0, vmax='p99.2'
)
/nfs/team283/vk7/software/miniconda3farm5/envs/scvi-env2/lib/python3.7/site-packages/pandas/core/arrays/categorical.py:2487: FutureWarning: The `inplace` parameter in pandas.Categorical.remove_unused_categories is deprecated and will be removed in a future version. res = method(*args, **kwargs)